{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Pandas 学习日常笔记"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[TOC]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pandas input\\output\n",
    "### Pandas.read_csv() 读取txt数据集\n",
    "当然也可以使用 Pandas.read_table()读取txt,但是它的默认分割符号是'\\t'，所以要加参数 delimiter=','"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "pd.read_table()中有种各样的参数，我会慢慢积累常用的参数\n",
    "\n",
    "1. header\n",
    "- 若header不指定，则默认文件的第一行为每列的表头名字\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": "   2104  3  399900\n0  1600  3  329900\n1  2400  3  369000\n2  1416  2  232000",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>2104</th>\n      <th>3</th>\n      <th>399900</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>1600</td>\n      <td>3</td>\n      <td>329900</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>2400</td>\n      <td>3</td>\n      <td>369000</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>1416</td>\n      <td>2</td>\n      <td>232000</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "metadata": {},
     "execution_count": 3
    }
   ],
   "source": [
    "pd.read_csv('./ex1data2.txt', nrows=3) #nrows=3 只读取前3行"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- 若header=0, 则用数据的第一行为表头的名字（如果skip_blank_lines=True，此参数将忽略带注释的行和空行 ，因此header=0表示数据的第一行而不一定是文件的第一行）\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": "   2104  3  399900\n0  1600  3  329900\n1  2400  3  369000\n2  1416  2  232000",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>2104</th>\n      <th>3</th>\n      <th>399900</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>1600</td>\n      <td>3</td>\n      <td>329900</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>2400</td>\n      <td>3</td>\n      <td>369000</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>1416</td>\n      <td>2</td>\n      <td>232000</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "metadata": {},
     "execution_count": 4
    }
   ],
   "source": [
    "pd.read_csv('./ex1data2.txt',header=0, nrows=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- 若header=None, 表示用自动分配的表头0,1,2 ... , n。(注意：对于自带有表头的数据集设置 header=None会报错)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": "      0  1       2\n0  2104  3  399900\n1  1600  3  329900\n2  2400  3  369000",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>0</th>\n      <th>1</th>\n      <th>2</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>2104</td>\n      <td>3</td>\n      <td>399900</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>1600</td>\n      <td>3</td>\n      <td>329900</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>2400</td>\n      <td>3</td>\n      <td>369000</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "metadata": {},
     "execution_count": 5
    }
   ],
   "source": [
    "pd.read_csv('./ex1data2.txt', header=None, nrows=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. prefix\n",
    "\n",
    "- prefix = str , 例： str为 'X'  则表头为 'X0','X1', ... , 'Xn'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": "     X0  X1      X2\n0  2104   3  399900\n1  1600   3  329900\n2  2400   3  369000",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>X0</th>\n      <th>X1</th>\n      <th>X2</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>2104</td>\n      <td>3</td>\n      <td>399900</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>1600</td>\n      <td>3</td>\n      <td>329900</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>2400</td>\n      <td>3</td>\n      <td>369000</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "metadata": {},
     "execution_count": 6
    }
   ],
   "source": [
    "pd.read_csv('./ex1data2.txt',header=None, prefix='X', nrows=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "3.names\n",
    "- names : 可接受列表类型的值, 可选项\n",
    "将列表的值作为数据集每列的表头.如果这个文件已经包含了表头列，那么你需要设置header=0 或 None 去覆盖已存在的表头列，否在会报错。 \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": "   size  num   price\n0  2104    3  399900\n1  1600    3  329900\n2  2400    3  369000",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>size</th>\n      <th>num</th>\n      <th>price</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>2104</td>\n      <td>3</td>\n      <td>399900</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>1600</td>\n      <td>3</td>\n      <td>329900</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>2400</td>\n      <td>3</td>\n      <td>369000</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "metadata": {},
     "execution_count": 7
    }
   ],
   "source": [
    "pd.read_csv('./ex1data2.txt',header=None, names=['size','num','price' ], nrows=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pandas 行列操作"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": "(0    2104\n 1    1600\n 2    2400\n Name: size, dtype: int64,\n 0    3\n 1    3\n 2    3\n Name: num, dtype: int64,\n 0    399900\n 1    329900\n 2    369000\n Name: price, dtype: int64)"
     },
     "metadata": {},
     "execution_count": 8
    }
   ],
   "source": [
    "df = pd.read_csv('./ex1data2.txt',header=None, names=['size','num','price' ], nrows=3)\n",
    "df_o = pd.read_csv('./ex1data2.txt',header=None, names=['size','num','price' ])\n",
    "df['size'],df['num'],df['price']\n",
    "#取size列    num列     price列"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": "   size  num   price\n0  2104    3  399900\n1  1600    3  329900\n2  2400    3  369000",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>size</th>\n      <th>num</th>\n      <th>price</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>2104</td>\n      <td>3</td>\n      <td>399900</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>1600</td>\n      <td>3</td>\n      <td>329900</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>2400</td>\n      <td>3</td>\n      <td>369000</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "metadata": {},
     "execution_count": 9
    }
   ],
   "source": [
    "df.head(4) #返回前多少行。包括表头列"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": "size       1600\nnum           3\nprice    329900\nName: 1, dtype: int64"
     },
     "metadata": {},
     "execution_count": 10
    }
   ],
   "source": [
    "df.loc[1] #读取 index=1的行数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": "array([  1600,      3, 329900], dtype=int64)"
     },
     "metadata": {},
     "execution_count": 11
    }
   ],
   "source": [
    "df.loc[1].values #读取 index=1行的数据 并转化为 ndarray"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": "(2104, 2104)"
     },
     "metadata": {},
     "execution_count": 15
    }
   ],
   "source": [
    "df.at[0,'size'] ,df.loc[0].at['size']\n",
    "# 读取单个值 [行，列标签]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": "   size  num\n0  2104    3\n1  1600    3\n2  2400    3",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>size</th>\n      <th>num</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>2104</td>\n      <td>3</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>1600</td>\n      <td>3</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>2400</td>\n      <td>3</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "metadata": {},
     "execution_count": 31
    }
   ],
   "source": [
    "# df[['size','num']] #取前两列 标签\n",
    "# df.loc[:,['size','num']] 标签\n",
    "df.iloc[:,0:2] #索引"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pandas 查询"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. 单条件查询\n",
    "\n",
    "可用 > < == != 来进行筛选"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": "    size  num   price\n13  4478    5  699900\n19  3031    4  599000\n24  3890    3  573900\n33  3137    3  579900\n38  4215    4  549000",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>size</th>\n      <th>num</th>\n      <th>price</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>13</th>\n      <td>4478</td>\n      <td>5</td>\n      <td>699900</td>\n    </tr>\n    <tr>\n      <th>19</th>\n      <td>3031</td>\n      <td>4</td>\n      <td>599000</td>\n    </tr>\n    <tr>\n      <th>24</th>\n      <td>3890</td>\n      <td>3</td>\n      <td>573900</td>\n    </tr>\n    <tr>\n      <th>33</th>\n      <td>3137</td>\n      <td>3</td>\n      <td>579900</td>\n    </tr>\n    <tr>\n      <th>38</th>\n      <td>4215</td>\n      <td>4</td>\n      <td>549000</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "metadata": {},
     "execution_count": 13
    }
   ],
   "source": [
    "df_o[df_o['size']>3000] #查询size>3000"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. 多条件查询\n",
    "\n",
    "可用 | (或)， & (且) 逻辑连接多个条件查询"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": "    size  num   price\n0   2104    3  399900\n2   2400    3  369000\n29  2637    3  299900\n32  2040    4  314900\n37  2132    4  345000\n39  2162    4  287000\n41  2238    3  329900\n42  2567    4  314000",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>size</th>\n      <th>num</th>\n      <th>price</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>2104</td>\n      <td>3</td>\n      <td>399900</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>2400</td>\n      <td>3</td>\n      <td>369000</td>\n    </tr>\n    <tr>\n      <th>29</th>\n      <td>2637</td>\n      <td>3</td>\n      <td>299900</td>\n    </tr>\n    <tr>\n      <th>32</th>\n      <td>2040</td>\n      <td>4</td>\n      <td>314900</td>\n    </tr>\n    <tr>\n      <th>37</th>\n      <td>2132</td>\n      <td>4</td>\n      <td>345000</td>\n    </tr>\n    <tr>\n      <th>39</th>\n      <td>2162</td>\n      <td>4</td>\n      <td>287000</td>\n    </tr>\n    <tr>\n      <th>41</th>\n      <td>2238</td>\n      <td>3</td>\n      <td>329900</td>\n    </tr>\n    <tr>\n      <th>42</th>\n      <td>2567</td>\n      <td>4</td>\n      <td>314000</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "metadata": {},
     "execution_count": 14
    }
   ],
   "source": [
    "df_o[ (df_o['size']>2000) & (df_o['price']<400000) ] #size>2000 且 price<400000"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6-final"
  },
  "orig_nbformat": 2,
  "kernelspec": {
   "name": "python3",
   "display_name": "Python 3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}